Following the Perturbed Leader to Gamble at Multi-armed Bandits
نویسندگان
چکیده
Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the k levers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstochastic optimization setting out of which the most demanding one is a game played against an adaptive adversary in the bandit setting. An adaptive adversary may alter its game strategy of assigning costs to decisions depending on the decisions chosen by the gambler in the past. In the bandit setting the gambler only gets to know the cost of the choice he made, rather than the costs of all available alternatives. In this work we show that the very straightforward and easy to implement algorithm Adaptive Bandit fpl can attain a regret of O( √ T lnT ) against an adaptive adversary. This regret holds with respect to the best lever in hindsight and matches the previous best regret bounds of O( √ T lnT ).
منابع مشابه
FPL Analysis for Adaptive Bandits
A main problem of “Follow the Perturbed Leader” strategies for online decision problems is that regret bounds are typically proven against oblivious adversary. In partial observation cases, it was not clear how to obtain performance guarantees against adaptive adversary, without worsening the bounds. We propose a conceptually simple argument to resolve this problem. Using this, a regret bound o...
متن کاملBeyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits
Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret ...
متن کاملThe Epoch-Greedy Algorithm for Contextual Multi-armed Bandits
We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales asO(T S) or better (sometimes, much better). Here S is th...
متن کاملMulti-armed Bandits with Constrained Arms and Hidden States
The problem of rested and restless multi-armed bandits with constrained availability of arms is considered. The states of arms evolve in Markovian manner and the exact states are hidden from the decision maker. First, some structural results on value functions are claimed. Following these results, the optimal policy turns out to be a threshold policy. Further, indexability of rested bandits is ...
متن کاملModal Bandits
Analyses of multi-armed bandits primarily presume that the value of an arm is its expected reward. We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007